Search CORE

129 research outputs found

A Bayesian Approach to Learning Hidden Markov Model Topology with Applications to Biological Sequence Analysis

Author: Schliep Alexander
Publication venue
Publication date: 01/01/2001
Field of study

Hidden-Markov-Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition. One fundamental problem in the application of HMMs is finding the underlying architecture or topology, particularly when there is no strong evidence from the application domain — e.g., when doing black box modeling. Topology is important with regard to good parameter estimates and with regard to performance: A model with “too many” states — and hence too many parameters — requires too much training data while an model with “not enough” states impedes the HMM from capturing subtle statistical patterns. We have developed a novel algorithm that, given sequence data originating from an ergodic process, infers an HMM, its topology and its parameters. We introduce a Bayesian approach

CiteSeerX

computer science publication server

Kölner UniversitätsPublikationsServer

An Algorithm to Select Target Specific Probes for DNA Chips

Author: Kaderali Lars
Schliep Alexander
Publication venue
Publication date: 01/01/2001
Field of study

Motivation: The selection of target specific probes is a relevant problem in the design of DNA chips. Given a set S of genomic sequences, the task is to find at least one oligonucleotide, called probe, for each target sequence in S. This probe will be attached to the chip surface, and must be chosen in a way that it will not hybridize to any other sequence but the intended target. Furthermore, all probes on the chip must hybridize to their intended targets under the same reaction conditions, most importantly at the temperature T at which the experiment is conducted. Results: We present an efficient algorithm for the probe design problem. Melting temperatures are calculated for all possible probe-target interactions using an extended nearest-neighbor model, allowing for both non-Watson-Crick base-pairing and unpaired bases within a duplex. To compute temperatures efficiently, a combination of suffix trees and dynamic programming based alignment algorithms is introduced. Additional filtering steps during preprocessing increase the speed of the computation. Also, an algorithm to select the actual probes from the set of candidates is presented. The practicability of the algorithms is demonstrated by two case studies: The computation of probes for the identification of different HIV-1 subtypes, and finding probes for 28S rDNA sequences from over 400 organisms

computer science publication server

CiteSeerX

SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding

Author: Chen Kevin C.
Roy Rajat S.
Schliep Alexander
Sengupta Anirvan M.
Publication venue
Publication date: 09/11/2011
Field of study

Scaffolding is an important subproblem in "de novo" genome assembly in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.Comment: 16 pages, 6 figures, 7 table

arXiv.org e-Print Archive

CiteSeerX

CLEVER: Clique-Enumerating Variant Finder

Author: Bauer Markus
Canzar Stefan
Costa Ivan
Klau Gunnar
Marschall Tobias
Schliep Alexander
Schönhuth Alexander
Publication venue
Publication date: 01/01/2012
Field of study

Next-generation sequencing techniques have facilitated a large scale analysis of human genetic variation. Despite the advances in sequencing speeds, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals. Here we present a novel internal segment size based approach, which organizes all, including also concordant reads into a read alignment graph where max-cliques represent maximal contradiction-free groups of alignments. A specifically engineered algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions (indels). For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present various relevant performance statistics. We achieve superior performance rates in particular on indels of sizes 20--100, which have been exposed as a current major challenge in the SV discovery literature and where prior insert size based approaches have limitations. In that size range, we outperform even split read aligners. We achieve good results also on real data where we make a substantial amount of correct predictions as the only tool, which complement the predictions of split-read aligners. CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

VU Research Portal

CWI's Institutional Repository

Publikationsserver der RWTH Aachen University

Publications at Bielefeld University

Partially-supervised protein subclass discovery with simultaneous annotation of functional residues

Author: Georgi Benjamin
Schliep Alexander
Schultz Jörg
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level. Results We have developed an extension of the <it>context-specific independence </it>mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach. Conclusion The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Gene expression trees in lymphoid development

Author: Costa Ivan G
Roepcke Stefan
Schliep Alexander
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the lymphoid system due to its importance for basic biology and for clinical applications. Gene expression measured in lymphoid cells in several distinguishable developmental stages helps in the elucidation of underlying molecular processes, which change gradually over time and lock cells in either the B cell, T cell or Natural Killer cell lineages. Large-scale analysis of these <it>gene expression trees </it>requires computational support for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes. Results We present the first statistical framework designed to analyze gene expression data as it is collected in the course of lymphoid development through clusters of co-expressed genes and additional heterogeneous data. We introduce dependence trees for continuous variates, which model the inherent dependencies during the differentiation process naturally as gene expression trees. Several trees are combined in a mixture model to allow inference of potentially overlapping clusters of co-expressed genes. Additionally, we predict microRNA targets. Conclusion Computational results for several data sets from the lymphoid system demonstrate the relevance of our framework. We recover well-known biological facts and identify promising novel regulatory elements of genes and their functional assignments. The implementation of our method (licensed under the GPL) is available at <url>http://algorithmics.molgen.mpg.de/Supplements/ExpLym/</url>.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Strongly Connected Components can Predict Protein Structure

Author: Bolten Eva
Schliep Alexander
Schneckener Sebastian
Schomburg Dietmar
Schrader Rainer
Publication venue: 'Elsevier BV'
Publication date: 01/01/2001
Field of study

Kölner UniversitätsPublikationsServer

Bayesian localization of CNV candidates in WGS data within minutes

Author: Cagan Alex
Gulevich Rimma
Kozhemyakina Rimma
Schliep Alexander
Wiedenhoeft John
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Background: Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward-Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. Results: In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-Time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. Conclusions: Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop

Chalmers Research

MPG.PuRe

Constrained mixture estimation for analysis and robust classification of clinical time series

Author: Alexander Schliep
Alexander Schönhuth
Archelos
Bar-Joseph
Baranzini
Basu
Bilmes
Borgwardt
Brunet
Castelli
Chapelle
Christoph Hafemeister
Costa
Eisen
Ernst
Fraley
Hastie
Irizarry
Ivan G. Costa
Kaminski
Lange
Lin
Lottaz
Lu
MacLachlan
Monti
Nelms
Nigam
Reimand
Ro
Satoh
Schliep
Schliep
Schliep
Schönhuth
Spang
van Baarsen
van't Veer
Yang
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-β (IFNβ) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNβ treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects

CiteSeerX

Crossref

PubMed Central